Skip to content

[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le#35081

Merged
bigPYJ1151 merged 5 commits intovllm-project:mainfrom
Akashcodes732:feat/enable_pc_cp_ppc
Feb 26, 2026
Merged

[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le#35081
bigPYJ1151 merged 5 commits intovllm-project:mainfrom
Akashcodes732:feat/enable_pc_cp_ppc

Conversation

@Akashcodes732
Copy link
Copy Markdown
Contributor

@Akashcodes732 Akashcodes732 commented Feb 23, 2026

Purpose

Removes the check for POWERPC in vllm/engine/arg_utils.py to enabled chunked prefill and prefix caching

Test Plan and Result

Server ran with Prefix Caching Enabled

vllm bench serve \
    --backend openai \
    --model ibm-granite/granite-3.3-8b-instruct \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --prefix-repetition-prefix-len 512 \
    --prefix-repetition-suffix-len 128 \
    --prefix-repetition-num-prefixes 5 \
    --prefix-repetition-output-len 128
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  186.79    
Total input tokens:                      64000     
Total generated tokens:                  11202     
Request throughput (req/s):              0.54      
Output token throughput (tok/s):         59.97     
Peak output token throughput (tok/s):    196.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          402.61    
---------------Time to First Token----------------
Mean TTFT (ms):                          53624.07  
Median TTFT (ms):                        61559.23  
P99 TTFT (ms):                           77142.36  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1205.40   
Median TPOT (ms):                        1087.50   
P99 TPOT (ms):                           3717.79   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1070.85   
Median ITL (ms):                         869.89    
P99 ITL (ms):                            15577.77  

Ran server with prefix caching disabled

vllm serve ibm-granite/granite-3.3-8b-instruct --max-model-len 4096 --max_num_batched_tokens 4096 --no-enable-prefix-caching
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  425.90    
Total input tokens:                      64000     
Total generated tokens:                  11483     
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         26.96     
Peak output token throughput (tok/s):    190.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          177.23    
---------------Time to First Token----------------
Mean TTFT (ms):                          166706.25 
Median TTFT (ms):                        160896.68 
P99 TTFT (ms):                           313166.60 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3161.35   
Median TPOT (ms):                        2086.99   
P99 TPOT (ms):                           19825.83  
---------------Inter-token Latency----------------
Mean ITL (ms):                           2122.36   
Median ITL (ms):                         916.20    
P99 ITL (ms):                            19883.92  

Fixed Prompt with Prefix Caching

python benchmarks/benchmark_prefix_caching.py \ 
 --model ibm-granite/granite-3.3-8b-instruct \
 --enable-prefix-caching \
 --num-prompts 1 \
 --repeat-count 100 \
 --input-length-range 128:256 
Testing filtered requests
------start generating------
Rendering prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 544.56it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:09<00:00, 10.69it/s, est. speed input: 2757.76 toks/s, output: 106.89 toks/s]
cost time 9.541510581970215

ShareGPT Dataset with Prefix Caching

python benchmarks/benchmark_prefix_caching.py \ 
  --model ibm-granite/granite-3.3-8b-instruct \
  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256
Testing filtered requests
------start generating------
Rendering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1102.57it/s]
Processed prompts:   0%|                                                                                                                                   | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 02-22 08:39:06 [loggers.py:259] Engine 000: Avg prompt throughput: 201.2 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 39 reqs, Waiting: 61 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 37.9%
INFO 02-22 08:39:24 [loggers.py:259] Engine 000: Avg prompt throughput: 216.8 tokens/s, Avg generation throughput: 5.5 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 55.8%
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:45<00:00,  2.20it/s, est. speed input: 404.27 toks/s, output: 22.02 toks/s]
cost time 45.508230686187744

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables chunked prefill and prefix caching for the PowerPC (ppc64le) architecture by removing the explicit architecture check in the engine configuration. The provided benchmark results demonstrate successful operation and performance improvements on this hardware. However, the refactoring is incomplete as the log messages within the associated code block still incorrectly list 'POWER' as an unsupported architecture, which will lead to misleading information for users on other platforms like s390x or RISC-V.

Comment thread vllm/engine/arg_utils.py Outdated
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
@Akashcodes732
Copy link
Copy Markdown
Contributor Author

Hi @bigPYJ1151 ,

Can you please take a look at this PR ?

@bigPYJ1151 bigPYJ1151 self-assigned this Feb 24, 2026
@Akashcodes732
Copy link
Copy Markdown
Contributor Author

Hi @bigPYJ1151 ,

Can you please look at the changes ?

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akashcodes732.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 24, 2026
@bigPYJ1151
Copy link
Copy Markdown
Member

Hi @Akashcodes732 There are some conflicts need to resolve :)

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
@mergify mergify bot removed the needs-rebase label Feb 25, 2026
@Akashcodes732
Copy link
Copy Markdown
Contributor Author

Hi @bigPYJ1151 ,

I have fixed the merge conflicts.

@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) February 25, 2026 08:16
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 25, 2026
@Akashcodes732
Copy link
Copy Markdown
Contributor Author

Hi @bigPYJ1151 ,

The fails look unrelated to the fix, can you please suggest ?

@bigPYJ1151 bigPYJ1151 disabled auto-merge February 25, 2026 15:37
@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) February 25, 2026 15:38
@Akashcodes732
Copy link
Copy Markdown
Contributor Author

HI @bigPYJ1151 ,

I think you need to approve again :)

@bigPYJ1151 bigPYJ1151 merged commit e03ddcf into vllm-project:main Feb 26, 2026
52 checks passed
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…4le (vllm-project#35081)

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants